Here, we will evaluate methods for deriving TWAS SNP-weights based on eQTL summary statistics.
There are two stages to this process:
Existing approaches:
The FUSION and MetaXcan approach is better for determining genes to include in a GWAS as it better reflects the power that TWAS would have. We may not always have a target sample to test the variance explained by the SNP-weights, so the MetaXcan approach is less applicable. So, I think we should use the same criteria as FUSION, except we must use a summary statistic based approach to estimate SNP-based heritability. We should compare summary statistic approaches to estimates from GREML as reported in FUSION released SNP-weights. The approach also needs to be fast as it will need to be implemented for each gene seperately.
SMR uses single eQTL summary statistics, making it applicable to a wider range of datasets and results from meta-analyses. TWAS based methods are not currently relevent when only eQTL summary statistics are available, as the advantage of these methods over SMR is their ability to incorporate the effects of multiple SNPs on gene expression. Currently, TWAS methods use SNP-weights derived using individual level data, which means TWAS cannot use meta-analysis results for eQTLs, and often TWAS weights are unavailable for the latest eQTL datasets. So, here we will compare a range of summary statistic approaches to derive TWAS weights. Again, the method will need to be fast since it will need to be applied to each genes seperately. Summary statistic based polygenic scoring methods will likely be of use, though given the lack of target samples for validation, a pseudovalidation approach will be necessary, which estimates the optimal parameters without a target sample for validation.
Possible methods for deriving SNP-weights:
Comparison to FUSION weights is not ideal as by chance different leads SNP could be selected. Ideally we would compare summary statistic-based and observed TWAS weights to observed expression. However, lets start with this as it will allow comparison of heritability estimates and be a good opportunity to design the study.
We will use whole blood data for the comparison. eQTL summary statistics are downloaded from here. FUSION TWAS SNP-weights were downloaded from here.
We will compare different version of GCTB Bayes models and LDSC. If LDSC doesn’t work, then LDAK model is unlikely to make a difference. To avoid unessecary compuational burden, only estimate hsq for 100 genes on chromosome 22.
Show code
# Split the GTEx eQTL data by chromosome
for chr in $(seq 1 22);do
pattern=$(echo ^${chr}_)
awk -v pat="$pattern" '$2 ~ pat { print }' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
cat <(head -n1 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.txt) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt
rm /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr${chr}.txt_tmp
done
library(data.table)
# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/RDat_files')
# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')
sbayesr_h2<-NULL
sbayesr_robust_h2<-NULL
sbayess_h2<-NULL
ldsc_h2<-NULL
# Run across each chromosome seperately
for(chr_i in 22){
# Read in eQTL data
eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
# Create CHR, BP, A1, A2, and Build columns
var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
names(var_id)<-c('CHR','BP','A1','A2','Build')
var_id$CHR<-as.numeric(var_id$CHR)
var_id$BP<-as.numeric(var_id$BP)
eqtl<-cbind(eqtl, var_id)
# Insert IUPAC codes for SNPs
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
# Remove INDELS
eqtl<-eqtl[!is.na(eqtl$IUPAC),]
# Calculate N for each association
eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
# Check the build
table(eqtl$Build)
# They are all b37
###
# Insert RSIDs
###
# Read in 1KG reference SNP data
ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
# Extract hm3 SNPs
ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
# Insert IUPAC codes
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'
# Merge target and reference based on position
ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
# I have used the GTEx reference data as well, and the same number is returned.
# Note: The A1 allele is the ref allele.
# Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
# Identify SNPs for which alleles need to be flipped
flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
# Idenitfy SNPs which match the reference alleles
incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
# If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
# Flip alleles where necessary
flip_tmp$A1_flipped<-flip_tmp$A1.x
flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
flip_tmp$A1.x<-flip_tmp$A1_flipped
flip_tmp$A1_flipped<-NULL
flip_tmp$A2_flipped<-flip_tmp$A2.x
flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
flip_tmp$A2.x<-flip_tmp$A2_flipped
flip_tmp$A2_flipped<-NULL
# Recombine and format
eqtl_harm<-rbind(flip_tmp, incl)
eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'
# Remove variants with MAF < 1%
eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
eqtl<-eqtl_harm
rm(eqtl_harm)
# Identify unique genes
genes<-unique(eqtl$gene_id)
# Subset the first 100 genes
genes<-genes[1:100]
# Run for each gene seperately
for(gene_i in genes){
print(which(genes == gene_i))
eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
#######
# SBayesR
#######
# Format for SBayesR
eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
# write in SBayesR format
fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run SBayesR
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR'), intern=T)
# Run SBayesR
log2<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --chain-length 10000 --robust --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust'), intern=T)
#######
# SBayesS
#######
# Run SBayesS
log3<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes S --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.1 --hsq 0.5 --chain-length 25000 --burn-in 5000 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.txt --exclude-mhc --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS'), intern=T)
#######
# LDSC
#######
# Format for LDSC
eqtl_gene_i_ldsc<-eqtl_gene_i[,c('SNP','A1','A2','slope','slope_se','N'), with=F]
names(eqtl_gene_i_ldsc)<-c('SNP','A1','A2','BETA','SE','N')
eqtl_gene_i_ldsc$Z<-eqtl_gene_i_ldsc$BETA/eqtl_gene_i_ldsc$SE
eqtl_gene_i_ldsc<-eqtl_gene_i_ldsc[,c('SNP','A1','A2','Z','N'), with=F]
# write in LDSC format
fwrite(eqtl_gene_i_ldsc, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run LDSC
log4<-system(paste0('/users/k1806347/brc_scratch/Software/ldsc.sh --h2 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.txt --ref-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --w-ld-chr /users/k1806347/brc_scratch/Data/ldsc/eur_w_ld_chr/ --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2'), intern=T)
########
# Tabulate SNP-h2 results
########
# Read SbayesR heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
}
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesR/',gene_i,'.SBayesR_robust.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_robust_h2<-rbind(sbayesr_robust_h2, par_res_file_i)
}
# Read SbayesS heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/SBayesS/',gene_i,'.SBayesS.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayess_h2<-rbind(sbayess_h2, par_res_file_i)
}
# Read LDSC heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))){
par_res_file_i<-readLines(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/LDSC/',gene_i,'.h2.log'))
par_res_file_i<-gsub('.*: ','',par_res_file_i[grepl('Total Observed scale h2',par_res_file_i)])
hsq<-as.numeric(gsub(' .*','',par_res_file_i))
se<-as.numeric(gsub(")",'',gsub(".*\\(",'',par_res_file_i)))
p<-pnorm(hsq/se, lower.tail = F)
ldsc_h2<-rbind(ldsc_h2, data.frame(gene=gene_i,
hsq=hsq,
se=se,
p=p))
}
}
}
# Compare hsq estimates across methods
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))
fusion_files<-intersect(fusion_files, genes)
greml_h2<-NULL
for(genes_i in fusion_files){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
fusion<-data.frame(wgt.matrix)
fusion$SNP<-row.names(fusion)
greml<-c(hsq, hsq.pv)
greml_h2<-rbind(greml_h2, data.frame(gene=genes_i,
hsq=greml[1],
se=greml[2],
p=greml[3]))
}
# Combine results across methods
sbayesr_h2$method<-'sbayesr'
sbayesr_robust_h2$method<-'sbayesr_robust'
sbayess_h2$method<-'sbayess'
ldsc_h2$method<-'ldsc'
greml_h2$method<-'greml'
all_h2<-do.call(rbind, list(sbayesr_h2,
sbayesr_robust_h2,
sbayess_h2,
ldsc_h2,
greml_h2))
write.table(all_h2, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt', col.names=T, row.names=F, quote=F)
# Plot the results
library(data.table)
all_h2<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.txt')
library(UpSetR)
# Remove estimates which did not converge
all_h2<-all_h2[all_h2$se != 0,]
# Merge estimates an check correlation
all_h2_full_list<-split(all_h2[,c('gene','hsq'),with=F], f = all_h2$method)
for(i in names(all_h2_full_list)){
names(all_h2_full_list[[which(names(all_h2_full_list) == i)]])<-c('gene',i)
}
all_h2_full_dat<-Reduce(function(...) merge(..., by='gene', all=T), all_h2_full_list)
library("ggplot2")
library("GGally")
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red')
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/h2_estimates.png', units = 'px', res=300, width=2000, height=2000)
ggpairs(all_h2_full_dat[,-1], lower = list(continuous = wrap(lowerfun)))
dev.off()
# Note. LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn't mean it is inaccurate as we don't know the truth.
# Check the number of genes that are significant in each method and how they overlap across methods.
# Extract significant estimates
all_h2_sig<-all_h2[all_h2$p < 0.01,]
n_gene<-data.frame(table(all_h2_sig$method))
names(n_gene)<-c('Method','N_genes')
# The number is similar across methods, but SBayesR has the highest.
all_h2_sig_list<-split(all_h2_sig[['gene']], f = all_h2_sig$method)
# Plot number of genes with valid and significant estimates
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.png', units = 'px', res=300, width=1500, height=1000)
upset(fromList(all_h2_sig_list), nsets=10, order.by = "freq")
dev.off()
# Again LDSC is least concordant with other methods.
# Compare number overlapping with GREML list
all_h2_greml_sig_list<-all_h2_sig_list
for(i in names(all_h2_greml_sig_list)){
tmp<-all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]
tmp<-tmp[tmp %in% all_h2_greml_sig_list[['greml']]]
all_h2_greml_sig_list[[which(names(all_h2_greml_sig_list) == i)]]<-tmp
}
n_gene$GREML_overlap<-unlist(lapply(all_h2_greml_sig_list, length))
write.csv(n_gene, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/n_sig_hsq.csv', row.names=F)
# Note. SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.
Show SNP-h2 estimates across methods
LDSC estimates are very inaccurate and do not correlated well with other methods. SBayesR correlates best wit GREML, and SBayesR and SBayesR with robust setting are highly correlated. The SBayesS method correlates less well with other methods, but this doesn’t mean it is inaccurate as we don’t know the truth.
Show number of genes with significant SNP-h2
| Method | N_genes | GREML_overlap |
|---|---|---|
| greml | 39 | 39 |
| ldsc | 30 | 21 |
| sbayesr | 36 | 33 |
| sbayesr_robust | 33 | 27 |
| sbayess | 35 | 27 |
The number is similar across methods, but SBayesR has the highest.
SBayesR has the largest overlap with GREML. Given this, and that SBayesR finds the most significant genes, I think we should use SBayesR without robust parameteristation to estimate heritability.
Again LDSC is least concordant with other methods.
Now we have decided which genes to create SNP-weights for, we will generate SNP-weights for predicting gene expression using a range of methods. Then we will compare these SNP-weights to those derived by FUSION. Again, to avoid computational burden, start by running for the first 100 genes on chromosome 22.
Show code
library(data.table)
# Make output directory
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE')
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files')
# Read in HapMap3 SNP-list
hm3_snp<-fread('/users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist')
sbayesr_h2<-NULL
# Run across each chromosome seperately
for(chr_i in 22){
# Read in eQTL data
eqtl<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/Whole_Blood.allpairs.chr',chr_i,'.txt'))
# Create CHR, BP, A1, A2, and Build columns
var_id<-data.table(do.call(rbind, strsplit(eqtl$variant_id, '_')))
names(var_id)<-c('CHR','BP','A1','A2','Build')
var_id$CHR<-as.numeric(var_id$CHR)
var_id$BP<-as.numeric(var_id$BP)
eqtl<-cbind(eqtl, var_id)
# Insert IUPAC codes for SNPs
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='A']<-'W'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='C']<-'S'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='G' | eqtl$A1 == 'G' & eqtl$A2 =='A']<-'R'
eqtl$IUPAC[eqtl$A1 == 'C' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='C']<-'Y'
eqtl$IUPAC[eqtl$A1 == 'G' & eqtl$A2 =='T' | eqtl$A1 == 'T' & eqtl$A2 =='G']<-'K'
eqtl$IUPAC[eqtl$A1 == 'A' & eqtl$A2 =='C' | eqtl$A1 == 'C' & eqtl$A2 =='A']<-'M'
# Remove INDELS
eqtl<-eqtl[!is.na(eqtl$IUPAC),]
# Calculate N for each association
eqtl$N<-round((eqtl$ma_count/eqtl$maf)/2,1)
# Check the build
table(eqtl$Build)
# They are all b37
###
# Insert RSIDs
###
# Read in 1KG reference SNP data
ref_bim_i<-fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/all_phase3.chr',chr_i,'.bim'))
names(ref_bim_i)<-c('CHR','SNP','POS','BP','A1','A2')
# Extract hm3 SNPs
ref_bim_i<-ref_bim_i[ref_bim_i$SNP %in% hm3_snp$SNP,]
# Insert IUPAC codes
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='A']<-'W'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='C']<-'S'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='G' | ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='A']<-'R'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='C']<-'Y'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'G' & ref_bim_i$A2 =='T' | ref_bim_i$A1 == 'T' & ref_bim_i$A2 =='G']<-'K'
ref_bim_i$IUPAC[ref_bim_i$A1 == 'A' & ref_bim_i$A2 =='C' | ref_bim_i$A1 == 'C' & ref_bim_i$A2 =='A']<-'M'
# Merge target and reference based on position
ref_target<-merge(eqtl, ref_bim_i, by=c('CHR','BP'))
# I have used the GTEx reference data as well, and the same number is returned.
# Note: The A1 allele is the ref allele.
# Small number of SNPs remaining highlights the limitation of using HapMap3 SNPs only.
# Identify SNPs for which alleles need to be flipped
flip_tmp<-ref_target[(ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'M' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'K'),]
# Idenitfy SNPs which match the reference alleles
incl<-ref_target[ ref_target$IUPAC.x == 'R' & ref_target$IUPAC.y == 'R' |
ref_target$IUPAC.x == 'Y' & ref_target$IUPAC.y == 'Y' |
ref_target$IUPAC.x == 'K' & ref_target$IUPAC.y == 'K' |
ref_target$IUPAC.x == 'M' & ref_target$IUPAC.y == 'M' ]
# If a SNP that needs to be flipped has a duplicate that is on the correct strand, remove it.
flip<-flip_tmp[!(flip_tmp$SNP %in% incl$SNP)]
# Flip alleles where necessary
flip_tmp$A1_flipped<-flip_tmp$A1.x
flip_tmp$A1_flipped[flip_tmp$A1.x == 'A']<-'T'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'T']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'G']<-'C'
flip_tmp$A1_flipped[flip_tmp$A1.x == 'C']<-'G'
flip_tmp$A1.x<-flip_tmp$A1_flipped
flip_tmp$A1_flipped<-NULL
flip_tmp$A2_flipped<-flip_tmp$A2.x
flip_tmp$A2_flipped[flip_tmp$A2.x == 'A']<-'T'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'T']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'G']<-'C'
flip_tmp$A2_flipped[flip_tmp$A2.x == 'C']<-'G'
flip_tmp$A2.x<-flip_tmp$A2_flipped
flip_tmp$A2_flipped<-NULL
# Recombine and format
eqtl_harm<-rbind(flip_tmp, incl)
eqtl_harm<-eqtl_harm[,c('CHR','SNP','BP','A1.x','A2.x',"gene_id","variant_id","tss_distance","ma_samples","ma_count","maf","pval_nominal","slope","slope_se","N"), with=F]
names(eqtl_harm)[names(eqtl_harm) == 'A1.x']<-'A1'
names(eqtl_harm)[names(eqtl_harm) == 'A2.x']<-'A2'
# Remove variants with MAF < 1%
eqtl_harm<-eqtl_harm[eqtl_harm$maf >= 0.01,]
eqtl<-eqtl_harm
rm(eqtl_harm)
# Identify unique genes
genes<-unique(eqtl$gene_id)
genes<-genes[1:100]
# Run for each gene seperately
for(gene_i in genes){
print(which(genes == gene_i))
eqtl_gene_i<-eqtl[eqtl$gene_id == gene_i,]
# Sort by chromosome and bp
eqtl_gene_i<-eqtl_gene_i[order(eqtl_gene_i$CHR, eqtl_gene_i$BP),]
# Filter SNPs to those with N > 80% of max(N)
eqtl_gene_i<-eqtl_gene_i[eqtl_gene_i$N >= 0.8*max(eqtl_gene_i$N),]
#######
# SBayesR
#######
# Format for SBayesR
eqtl_gene_i_sbayesr<-eqtl_gene_i[,c('SNP','A1','A2','maf','slope','slope_se','pval_nominal','N'), with=F]
names(eqtl_gene_i_sbayesr)<-c('SNP','A1','A2','freq','b','se','p','N')
# write in SBayesR format
fwrite(eqtl_gene_i_sbayesr, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
# Run SBayesR
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR'), intern=T)
# Read SbayesR heritability result
if(file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))){
par_res_file_i<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.parRes'))
par_res_file_i<-par_res_file_i[par_res_file_i$V1 == 'hsq',2:3, with=F]
par_res_file_i$P<-pnorm(-abs(par_res_file_i$Mean/par_res_file_i$SD))
par_res_file_i$Gene<-gene_i
par_res_file_i<-par_res_file_i[,c('Gene','Mean','SD','P'), with=F]
names(par_res_file_i)<-c('gene','hsq','se','p')
sbayesr_h2<-rbind(sbayesr_h2, par_res_file_i)
# If SNP-h2 p < 0.01, generate SNP-weights
if(par_res_file_i$p < 0.01){
# Flip the effect of each method to match eqtl sumstats
ref_tmp<-eqtl_gene_i[, c('SNP','A1','A2'), with=F]
#############
# SBayesR
#############
# SBayesR has already been run, so just read in the SNP-weights
sbayesr_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.SBayesR.snpRes'))
sbayesr_score<-sbayesr_score[,c('Name','A1','A2','A1Effect'), with=F]
names(sbayesr_score)<-c('SNP','A1','A2','BETA')
# Flip effects so allele match eQTL sumstats
sbayesr_score<-merge(ref_tmp, sbayesr_score, by='SNP', all=T)
sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]<--sbayesr_score$BETA[which(sbayesr_score$A1.x == sbayesr_score$A2.y)]
sbayesr_score<-sbayesr_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
names(sbayesr_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
sbayesr_score<-sbayesr_score[match(eqtl_gene_i$SNP, sbayesr_score$SNP),]
#############
# SBayesR robust
#############
# SBayesR format sumstats are already available
# So just run SBayesR with robust setting
log<-system(paste0('/users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh --sbayes R --ldm /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr',chr_i,'.ldm.sparse --pi 0.95,0.02,0.02,0.01 --gamma 0.0,0.01,0.1,1 --gwas-summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR/',gene_i,'.txt --robust --chain-length 10000 --exclude-mhc --burn-in 2000 --out-freq 1000 --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR'), intern=T)
# Read in the results
sbayesr_robust_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SBayesR_robust/',gene_i,'.SBayesR.snpRes'))
sbayesr_robust_score<-sbayesr_robust_score[,c('Name','A1','A2','A1Effect'), with=F]
names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')
# Flip effects so allele match eQTL sumstats
sbayesr_robust_score<-merge(ref_tmp, sbayesr_robust_score, by='SNP', all=T)
sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]<--sbayesr_robust_score$BETA[which(sbayesr_robust_score$A1.x == sbayesr_robust_score$A2.y)]
sbayesr_robust_score<-sbayesr_robust_score[,c('SNP','A1.x','A2.x','BETA'), with=F]
names(sbayesr_robust_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
sbayesr_robust_score<-sbayesr_robust_score[match(eqtl_gene_i$SNP, sbayesr_robust_score$SNP),]
#############
# DBSLMM
#############
# Convert to GEMMA format
eqtl_gene_i_dbslmm<-eqtl_gene_i
eqtl_gene_i_dbslmm$N_MISS<-max(eqtl_gene_i_dbslmm$N)-eqtl_gene_i_dbslmm$N
eqtl_gene_i_dbslmm<-eqtl_gene_i_dbslmm[,c('CHR','SNP','BP','N_MISS','N','A1','A2','maf','slope','slope_se','pval_nominal'),with=F]
names(eqtl_gene_i_dbslmm)<-c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald')
# Match allele1 and 0 with A1 and 2 in reference (DBSLMM calls this allele discrepancy)
ref_bim_subset<-fread(paste0('/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,'.bim'))
GWAS_match<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V5','V6'))
GWAS_switch<-merge(eqtl_gene_i_dbslmm, ref_bim_subset[,c('V2','V5','V6'),with=F], by.x=c('rs','allele1','allele0'), by.y=c('V2','V6','V5'))
GWAS_switch$allele_tmp<-GWAS_switch$allele0
GWAS_switch$allele0<-GWAS_switch$allele1
GWAS_switch$allele1<-GWAS_switch$allele_tmp
GWAS_switch$allele_tmp<-NULL
GWAS_switch$beta<--GWAS_switch$beta
GWAS_switch$af<-1-GWAS_switch$af
GWAS<-rbind(GWAS_match, GWAS_switch)
GWAS<-GWAS[order(GWAS$chr, GWAS$ps),]
GWAS<-GWAS[,c('chr','rs','ps','n_mis','n_obs','allele1','allele0','af','beta','se','p_wald'),with=F]
GWAS_N<-mean(GWAS$n_obs)
nsnp<-nrow(GWAS)
# Write out formatted sumstats
fwrite(GWAS, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM'), sep='\t', col.names=F)
# Run dbslmm
system('chmod 777 /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm')
system(paste0('/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/DBSLMM/software/DBSLMM.R --plink /users/k1806347/brc_scratch/Software/plink1.9.sh --block /users/k1806347/brc_scratch/Data/LDetect/EUR/fourier_ls-chr',chr_i,'.bed --dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software/dbslmm --h2 ',par_res_file_i$hsq,' --ref /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --summary /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gene_i,'.DBSLMM --n ',round(GWAS_N,0),' --nsnp ',nsnp,' --outPath /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/ --thread 1'))
# Read in the results
dbslmm_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/DBSLMM/',gsub('\\..*','',gene_i),'.dbslmm.txt'))
dbslmm_score<-dbslmm_score[,c(1,2,4), with=T]
names(dbslmm_score)<-c('SNP','A1','BETA')
# Flip effects so allele match eQTL sumstats
dbslmm_score<-merge(ref_tmp, dbslmm_score, by='SNP', all=T)
dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]<--dbslmm_score$BETA[which(dbslmm_score$A1.x != dbslmm_score$A1.y)]
dbslmm_score<-dbslmm_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(dbslmm_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
dbslmm_score<-dbslmm_score[match(eqtl_gene_i$SNP, dbslmm_score$SNP),]
##############
# PRScs
##############
# Format for PRScs
eqtl_gene_i_prscs<-eqtl_gene_i[,c('SNP','A1','A2','slope','pval_nominal'), with=F]
names(eqtl_gene_i_prscs)<-c('SNP','A1','A2','BETA','P')
# write in PRScs format
fwrite(eqtl_gene_i_prscs, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt'), sep=' ', na = "NA", quote=F)
system(paste0('/users/k1806347/brc_scratch/Software/PRScs.sh --ref_dir=/users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur --bim_prefix=/scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --sst_file=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'.txt --n_gwas=',round(GWAS_N,0),' --out_dir=/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,' --chrom=',chr_i))
# Read in the results
prscs_score<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/PRScs/',gene_i,'_pst_eff_a1_b0.5_phiauto_chr22.txt'))
prscs_score<-prscs_score[,c('V2','V4','V6'), with=F]
names(prscs_score)<-c('SNP','A1','BETA')
# Flip effects so allele match eQTL sumstats
prscs_score<-merge(ref_tmp, prscs_score, by='SNP', all=T)
prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]<--prscs_score$BETA[which(prscs_score$A1.x != prscs_score$A1.y)]
prscs_score<-prscs_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(prscs_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
prscs_score<-prscs_score[match(eqtl_gene_i$SNP, prscs_score$SNP),]
# Note PRScs takes a lot longer than other methods.
################
# SuSiE finemapping
################
# Read LD estimates for eQTL sumstats
write.table(eqtl_gene_i$SNP, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt'), col.names=F, row.names=F, quote=F)
system(paste0('/users/k1806347/brc_scratch/Software/plink1.9.sh --bfile /scratch/groups/biomarkers-brc-mh/Reference_data/1KG_Phase3/PLINK/EUR/EUR_phase3.MAF_001.chr',chr_i,' --extract /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'_snps.txt --r square --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i))
ld<-as.matrix(fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/SuSiE/',gene_i,'.ld')))
library(susieR)
tryCatch(fitted_rss <- susie_rss(eqtl_gene_i$slope/eqtl_gene_i$slope_se, ld, L = 10), error = function(e){skip_to_next <<- TRUE})
skip_to_next<-F
if(skip_to_next == TRUE){
susie_score<-data.table(SNP=eqtl_gene_i$SNP,
A1=eqtl_gene_i$A1,
BETA=NA)
} else {
susie_score<-data.table(SNP=eqtl_gene_i$SNP,
A1=eqtl_gene_i$A1,
BETA=eqtl_gene_i$slope*fitted_rss$pip)
}
# Flip effects so allele match eQTL sumstats
susie_score<-merge(ref_tmp, susie_score, by='SNP', all=T)
susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]<--susie_score$BETA[which(susie_score$A1.x != susie_score$A1.y)]
susie_score<-susie_score[,c('SNP','A1.x','A2','BETA'), with=F]
names(susie_score)<-c('SNP','A1','A2','BETA')
# Sort score file according eqtl_gene_i
susie_score<-susie_score[match(eqtl_gene_i$SNP, susie_score$SNP),]
# Create RDat file for FUSION
cv.performance<-as.matrix(data.frame(sbayesr=c(NA,NA),
sbayesr_robust=c(NA,NA),
dbslmm=c(NA,NA),
prscs=c(NA,NA),
susie=c(NA,NA),
top1=c(NA,NA),
row.names=c('rsq','pval')))
# Sort eQTL data into the same order as other score files
eqtl_gene_i<-eqtl_gene_i[match(sbayesr_score$SNP, eqtl_gene_i$SNP),]
hsq<-c(par_res_file_i$hsq, par_res_file_i$se)
hsq.pv<-par_res_file_i$p
N.tot<-max(eqtl_gene_i$N)
eqtl_gene_i$POS<-0
snps<-eqtl_gene_i[,c('CHR','SNP','POS','BP','A1','A2')]
names(snps)<-paste0('V',1:length(names(snps)))
wgt.matrix<-as.matrix(data.frame(sbayesr=sbayesr_score$BETA,
sbayesr_robust=sbayesr_robust_score$BETA,
dbslmm=dbslmm_score$BETA,
prscs=prscs_score$BETA,
susie=susie_score$BETA,
top1=eqtl_gene_i$slope,
row.names=snps$V2))
save(cv.performance,
hsq,
hsq.pv,
N.tot,
snps,
wgt.matrix,
file = paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',gene_i,'.RDat'))
}
}
}
}
library(data.table)
fusion_pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood.pos')
fusion_pos<-fusion_pos[fusion_pos$CHR == 22,]
dim(fusion_pos) # 234
eqtl_derived_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/', pattern='.RDat')
fusion_files<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/', pattern='.RDat')
length(eqtl_derived_files) # 159
length(fusion_files) # 6006
eqtl_derived_files<-gsub('.RDat','',eqtl_derived_files)
fusion_files<-gsub('.wgt.RDat','',gsub('Whole_Blood.','',fusion_files))
genes<-intersect(eqtl_derived_files, fusion_files)
length(genes) # 33
nsnp_res<-NULL
hsq_res<-NULL
cor_res<-list()
for(genes_i in genes){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/RDat_files/',genes_i,'.RDat'))
eqtl_derived<-data.frame(wgt.matrix)
if(sum(eqtl_derived$susie) == 0){
eqtl_derived$susie<-NA
}
eqtl_derived$SNP<-row.names(eqtl_derived)
sbayesr<-c(hsq, hsq.pv)
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/Whole_Blood/Whole_Blood/Whole_Blood.',genes_i,'.wgt.RDat'))
fusion<-data.frame(wgt.matrix)
fusion$SNP<-row.names(fusion)
greml<-c(hsq, hsq.pv)
nsnp_res<-rbind(nsnp_res, data.frame(gene_id=genes_i,
nsnp_eqtl_derived=nrow(eqtl_derived),
nsnp_fusion=nrow(fusion)))
hsq_res<-rbind(hsq_res, data.frame(gene_id=genes_i,
sbayesr_h2=sbayesr[1],
sbayesr_se=sbayesr[2],
sbayesr_p=sbayesr[3],
greml_h2=greml[1],
greml_se=greml[2],
greml_p=greml[3]))
both<-merge(eqtl_derived, fusion, by='SNP')
if(nrow(both) > 0){
cor_res[[genes_i]]<-cor(both[,-1], use='p')
}
}
# Compare h2
cor(hsq_res$sbayesr_h2, hsq_res$greml_h2) # 0.7500682
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/h2_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(hsq_res, aes(x=sbayesr_h2, y=greml_h2)) +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_point() +
coord_fixed() +
xlim(0,1) +
ylim(0,1)
dev.off()
# Compare nsnp
cor(nsnp_res$nsnp_eqtl_derived, nsnp_res$nsnp_fusion) # 0.4849533
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/nsnp_comp.png', units='px', width=1000, height=1000, res=300)
ggplot(nsnp_res, aes(x=nsnp_eqtl_derived, y=nsnp_fusion)) +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_point() +
coord_fixed()
dev.off()
# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$Var1<-gsub('top1.x','top1.GTEx',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.x','top1.GTEx',cor_res_melt$Var2)
cor_res_melt$Var1<-gsub('top1.y','top1.FUSION',cor_res_melt$Var1)
cor_res_melt$Var2<-gsub('top1.y','top1.FUSION',cor_res_melt$Var2)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v7/snp_weight_comp/cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
coord_flip()
dev.off()
Show GREML and SBayesR SNP-h2 estimates
Show number of variants in FUSION and eQTL-based models
Show correlation between FUSION and eQTL-based models
The correlation between eQTL derived TWAS weights and FUSION derived TWAS weights is similar to correlation between different FUSION models.
The correlation between eQTL derived and FUSION weights is of interest, but not a good evaluation metric of methods, We should compare the correlation between predicted and observed expression, when using eQTL derived TWAS weights or largest FUSION TWAS weights.
The low correlation between topSNP results from FUSION and GTEx indicates comparison of FUSION and eQTL susmtat derived weights are not very informative.
Here we will predict gene expression levels in the GTEx v8 sample and then test the correlation with observered expression levels. Initially, we will use FUSION SNP-weights derived using the YFS whole blood sample, and eQTL data from eQTLGen whole blood meta-analysis (excl GTEx).
I do not have access to individual level data from GTeX v8, so this project is being carried out in collaboration with Zac Gerring. Here, I will prepare the SNP-weights and the code to predict gene expression levels for Zac. As an example target sample, I will use the EUR subset of the 1KG Phase 3 reference.
I have already written a script to predicted expression levels from TWAS weights called FeaturePred.
Show code
# Download the full cis-eQTL data exluding GTEx
# This was sent privately by Urmo Vosa
mkdir /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx
# Extract relevent columns
zcat /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.txt.gz | cut -f 1-5,9-11,13 | gzip > /users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz
library(data.table)
# Read in relevent columns from the sumstats
sumstats<-fread('/users/k1806347/brc_scratch/Data/eQTLGen/excl_GTEx/2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-WoGTEx.small.txt.gz')
# Extract data for each gene
genes<-unique(sumstats$ProbeName)
# Read in EUR MAF
frq<-NULL
for(i in 1:22){
frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'
setkey(sumstats, ProbeName)
# Process eQTL sumstats for each gene
# For testing use only first 100 genes
dir.create('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen')
for(i in 1:length(genes)){
print(i)
tmp<-sumstats[.(genes[i])]
# Create A1 and A2 columns and update column names
tmp$A1<-gsub('.*/','',tmp$SNPType)
tmp$A2<-gsub('/.*','',tmp$SNPType)
tmp$A2[tmp$A1 != tmp$AlleleAssessed]<-gsub('.*/','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])
tmp$A1[tmp$A1 != tmp$AlleleAssessed]<-gsub('/.*','',tmp$SNPType[tmp$A1 != tmp$AlleleAssessed])
tmp<-tmp[,c('ProbeName','SNPName','SNPChr','SNPChrPos', 'A1', 'A2','PValue','OverallZScore','SumNumberOfSamples'), with=F]
names(tmp)<-c('Gene','SNP','CHR','BP','A1','A2','P','Z','N')
# Insert FREQ from EUR reference
# There don't seem to be any strand flips
tmp_match<-merge(tmp, frq[frq$CHR == tmp$CHR[1],c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
tmp_flip<-merge(tmp, frq[frq$CHR == tmp$CHR[1],c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
tmp_flip$FREQ<-1-tmp_flip$FREQ
tmp<-rbind(tmp_match, tmp_flip)
# Approximate BETA, SE, and P
tmp$P<-2*pnorm(-abs(tmp$Z))
tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
tmp$GENE<-genes[i]
tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
if(i == 1){
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=T, row.names=F, quote=F)
} else {
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt', col.names=F, row.names=F, quote=F, append=T)
}
}
Show code
# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/twas_weights
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
WGT=paste0('eQTLGen.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
# Run in parallel for first 100 genes
# Create file listing GWAS that haven't been processed.
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt | tail -n +2 | uniq | head -n 100 > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/gene_list.txt
> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/gene_list.txt);do
if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL/${i}.done ]; then
echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt
fi
done
# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS
ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt)
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
--id ${ID} \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL
EOF
sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/todo.txt | cut -d' ' -f1) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/sbatch.sh
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
WGT=paste0('eQTLGen.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt | tail -n +2 | uniq > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/gene_list.txt
> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/gene_list.txt);do
if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL/${i}.done ]; then
echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt
fi
done
# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS
ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt)
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
--id ${ID} \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL
EOF
sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/todo.txt | cut -d' ' -f1)%200 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/sbatch.sh
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='eQTLGen.eQTL',
WGT=paste0('eQTLGen.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','chromosome_name','start_position','end_position'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos', col.names=T, row.names=F, quote=F)
FUSION only provide the N of the sample and the Z of each SNP for each gene. We can convert this to the required information using reference MAF, though using these approximations is not ideal. This may not be the best comparison since they are finish and there may therefore be MAF and LD mismatch with target sample. Though I assume FUSION developers restricted the analysis to those os EUR ancestry.
Show code
library(data.table)
# Read in the pos file
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')
# Read in EUR MAF
frq<-NULL
for(i in 1:22){
frq<-rbind(frq, fread(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR.',i,'.frq')))
}
names(frq)[names(frq) == 'MAF']<-'FREQ'
for(i in 1:nrow(pos)){
print(i)
load(paste0('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/',pos$WGT[i]))
tmp<-data.table(cbind(snps, wgt.matrix[,which(colnames(wgt.matrix) == 'top1')]))
names(tmp)<-c('CHR','SNP','POS','BP','A1','A2','Z')
tmp$N<-pos$N[i]
# Insert EUR MAF
tmp_match<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by=c('SNP','A1','A2'))
tmp_flip<-merge(tmp, frq[,c('SNP','A1','A2','FREQ'),with=F], by.x=c('SNP','A1','A2'), by.y=c('SNP','A2','A1'))
tmp_flip$FREQ<-1-tmp_flip$FREQ
tmp<-rbind(tmp_match, tmp_flip)
# Approximate BETA, SE, and P
tmp$P<-2*pnorm(-abs(tmp$Z))
tmp$BETA<-tmp$Z/sqrt((2*tmp$FREQ)*(1-tmp$FREQ)*(tmp$N+sqrt(abs(tmp$Z))))
tmp$SE<-abs(tmp$BETA)/abs(tmp$Z)
tmp$GENE<-pos$ID[i]
tmp<-tmp[,c('GENE','CHR','SNP','BP','A1','A2','BETA','SE','Z','P','FREQ','N'),with=F]
if(i == 1){
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=T, row.names=F, quote=F)
} else {
write.table(tmp, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt', col.names=F, row.names=F, quote=F, append=T)
}
}
# Make this the input format for eQTL_to_TWAS script.
# Incorporate DENTIST QC of sumstats?
Show code
# Script modified to run only first 100 genes
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights.R \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/twas_weights
# Rename RDat folder to match FUSION format weights
mv /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/RDat_files /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL
# Create .pos file
library(data.table)
pos<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos')
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL', pattern='.RDat')
# Subset pos file to those with RDat file
pos<-pos[pos$ID %in% gsub('.RDat','',rdat_list),]
pos$PANEL<-'YFS.eQTL'
pos$WGT<-paste0(pos$ID,'.RDat')
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
# Run in parallel for first 100 genes
# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt | tail -n +2 | uniq | head -n 100 > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/gene_list.txt
> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/gene_list.txt);do
if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL/${i}.done ]; then
echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt
fi
done
# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS
ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt)
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
--id ${ID} \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL
EOF
sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/todo.txt | cut -d' ' -f1) /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/sbatch.sh
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='YFS.eQTL',
WGT=paste0('YFS.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('external_gene_name','chromosome_name','start_position','end_position'), mart = ensembl)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
pos<-pos[order(pos$CHR),]
pos<-pos[!duplicated(pos$WGT),]
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
# Create file listing GWAS that haven't been processed.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall
cut -f 1 -d' ' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt | tail -n +2 | uniq > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/gene_list.txt
> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt
for i in $(cat /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/gene_list.txt);do
if [ ! -f /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL/${i}.done ]; then
echo ${i} >> /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt
fi
done
# Create shell script to run using sbatch
cat > /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/sbatch.sh << 'EOF'
#!/bin/sh
#SBATCH -p shared,brc
#SBATCH --mem 10G
#SBATCH -n 1
#SBATCH --nodes 1
#SBATCH -J eQTL_to_TWAS
ID=$(awk -v var="$SLURM_ARRAY_TASK_ID" 'NR == var {print $1}' /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt)
/users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/eQTL_to_TWAS/compute_weights_V2.R \
--id ${ID} \
--extract /users/k1806347/brc_scratch/Data/ldsc/w_hm3.snplist \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
--gctb /users/k1806347/brc_scratch/Software/gctb_2.03beta_Linux/gctb_203.sh \
--gctb_ref /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/gctb_ref/ukbEURu_hm3_shrunk_sparse/ukbEURu_hm3_v3_50k_chr \
--plink_ref_chr /users/k1806347/brc_scratch/Data/1KG/Phase3/1KGPhase3.w_hm3.chr \
--plink_ref_keep /users/k1806347/brc_scratch/Data/1KG/Phase3/keep_files/EUR_samples.keep \
--ld_blocks /users/k1806347/brc_scratch/Data/LDetect/EUR \
--rscript /users/k1806347/brc_scratch/Software/Rscript.sh \
--dbslmm /users/k1806347/brc_scratch/Software/DBSLMM/software \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--PRScs_path /users/k1806347/brc_scratch/Software/PRScs.sh \
--PRScs_ref_path /users/k1806347/brc_scratch/Software/PRScs/ldblk_1kg_eur \
--ldpred2_ref_dir /scratch/groups/ukbiobank/usr/ollie_pain/GenoPredPipe/GenoPred/GenoPredPipe/resources/data/ldpred2_ref \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL
EOF
sbatch --array 1-$(wc -l /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/todo.txt | cut -d' ' -f1)%50 /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/sbatch.sh
# Create .pos file
library(data.table)
# Read in list of RDat files
rdat_list<-list.files(path='/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL', pattern='.RDat')
# Start making pos file
pos<-data.frame(PANEL='YFS.eQTL',
WGT=paste0('YFS.eQTL/',rdat_list),
ID=gsub('\\..*','',rdat_list))
# Insert CHR, P0 and P1 (GRCh=37)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('external_gene_name','chromosome_name','start_position','end_position'), mart = ensembl)
names(Genes)<-c('ID','CHR','P0','P1')
Genes<-Genes[complete.cases(Genes),]
Genes<-Genes[!duplicated(Genes),]
pos<-merge(pos, Genes, all.x=T, by='ID')
pos<-pos[order(pos$CHR),]
pos<-pos[!duplicated(pos$WGT),]
# Read in each RDat file to retrieve N
for(i in 1:length(rdat_list)){
load(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL/',rdat_list[i]))
pos$N[i]<-N.tot
}
pos<-pos[,c('PANEL','WGT','ID','CHR','P0','P1','N')]
write.table(pos, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos', col.names=T, row.names=F, quote=F)
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos \
--weights_dir /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2/YFS.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS2 \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score F \
--save_ref_expr F \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2.eQTL
Show code
sbatch -p brc,shared --mem 30G -n 5 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 5 \
--save_score F \
--save_ref_expr F \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFSall.eQTL
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGen.eQTL
Show code
sbatch -p brc,shared --mem 10G -n 1 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2/eQTLGen.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen2 \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 1 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGen2.eQTL
Show code
sbatch -p brc,shared --mem 30G -n 5 /users/k1806347/brc_scratch/Software/Rscript.sh /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R \
--PLINK_prefix_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--plink /users/k1806347/brc_scratch/Software/plink1.9.sh \
--n_cores 5 \
--save_score T \
--save_ref_expr T \
--memory 10000 \
--all_mod T \
--pigz /users/k1806347/brc_scratch/Software/pigz \
--ref_maf /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--output /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/eQTLGenall.eQTL
Show code
library(data.table)
fusion<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
eqtl<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.eQTL/FeaturePredictions_YFS.eQTL.txt.gz')
genes<-unique(gsub('\\..*','',gsub('YFS.eQTL.','',names(eqtl)[-1:-2])))
fusion<-fusion[,grepl(paste0('FID|IID|',paste(genes,collapse='|')),names(fusion)), with=F]
both<-merge(fusion, eqtl, by=c('FID','IID'))
cor_res<-list()
for(i in genes){
tmp<-both[,grepl(paste0('\\.',i,'\\.'), names(both)), with=F]
names(tmp)<-gsub('YFS.eQTL.RDat.','eqtl.',gsub('YFS.BLOOD.RNAARR.YFS.','twas.',gsub(paste0(i,'.'),'', names(tmp))))
cor_res[[i]]<-cor(tmp, use='p')
}
# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS_cor_plot.png', units='px', width=3000, height=3000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
coord_flip()
dev.off()
# Looks as expected, except eQTL.top1 model doesn't correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Thougb since we are filtering by N, this shouldn;t be a problem, and usin the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate. This should be discussed with Sasha.
load('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR/YFS.TRNAU1AP.wgt.RDat')
fusion_wgt<-data.frame(wgt.matrix)
load('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL/TRNAU1AP.RDat')
eqtl_wgt<-data.frame(wgt.matrix)
head(fusion_wgt[rev(order(abs(fusion_wgt$top1))),])
head(eqtl_wgt[rev(order(abs(eqtl_wgt$top1))),])
Show correlation between SNP-weights from each method
Looks as expected, except eQTL.top1 model doesn’t correlate well with fusion.top1 model. In many instances the correlation is 1, but there are many low correlation. This seems to be due to changing the top1 Z score in fusion files to a BETA estimated based on N and MAF. In some instances this changes the top SNP to a SNP that is uncorrelated with the top SNP in the fusion files. Perhaps I should select the top SNP based on the p-value, rather than the BETA. I think selecting based on p-value may be more concordant with standard pT+clump PRS methodology, and will avoid SNPs with small N and large SE being selected. Though since we are filtering by N, this shouldn’t be a problem, and using the SNP with the largest BETA may be more appropriate. However, the correlation between top1 and other models is lower when using the largest abs(BETA), suggesting using the most significant SNP may be more appropriate.
Show code
library(data.table)
fusion<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS.BLOOD.RNAARR/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
eqtl<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2.eQTL/FeaturePredictions_YFS.eQTL.txt.gz')
genes<-unique(gsub('\\..*','',gsub('YFS.eQTL.','',names(eqtl)[-1:-2])))
fusion<-fusion[,grepl(paste0('FID|IID|',paste(genes,collapse='|')),names(fusion)), with=F]
both<-merge(fusion, eqtl, by=c('FID','IID'))
cor_res<-list()
for(i in genes){
tmp<-both[,grepl(paste0('\\.',i,'\\.'), names(both)), with=F]
names(tmp)<-gsub('YFS.eQTL.RDat.','eqtl.',gsub('YFS.BLOOD.RNAARR.YFS.','twas.',gsub(paste0(i,'.'),'', names(tmp))))
cor_res[[i]]<-cor(tmp, use='p')
}
# Check SNP-weight correlations
cor_res_melt<-melt(cor_res)
cor_res_melt$test<-paste0(cor_res_melt$Var1,'_',cor_res_melt$Var2)
cor_res_average<-NULL
for(i in unique(cor_res_melt$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt$value[cor_res_melt$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/1kg_pred_exp/YFS2_cor_plot.png', units='px', width=3000, height=4000, res=300)
ggplot(cor_res_melt, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
coord_flip()
dev.off()
Show correlation between SNP-weights from each method
This is looking good. Everything is working as it should.
This will be run by Zac. I will send him the score files and reference expression, and fusion LD reference data. I don’t know the file paths so I have just put file names.
Show code
mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
# Copy FUSION reference with allele frequency files
cp -r /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF ./
# Copy FUSION .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir FUSION.YFS
cd FUSION.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR ./
# Copy eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
mkdir eQTL.YFS
cd eQTL.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./
# Copy pigz binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/pigz ./
# Copy plink binary
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/plink1.9/plink ./
# Copy FeaturePred.V2.0.R
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
cp /users/k1806347/brc_scratch/Software/MyGit/Predicting-TWAS-features/FeaturePred.V2.0.R ./
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac.tar.gz for_zac
##########
# Make a new folder containing updated YFS and eQTLGen TWAS weights
##########
mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
# Copy YFS eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.YFS
cd eQTL.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS.eQTL ./
# Copy eQTL-Gen eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_26042022
mkdir eQTL.eQTLGen
cd eQTL.eQTLGen
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL ./
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac_26042022.tar.gz for_zac_26042022
Show code
mkdir -p /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
# Copy eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
mkdir eQTL.YFS
cd eQTL.YFS
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFSall/YFS.eQTL ./
rm YFS.eQTL/*.done
# Copy eQTL-Gen eQTL-based .pos, .Rdat files, score files, and reference expression
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac_160622
mkdir eQTL.eQTLGen
cd eQTL.eQTLGen
cp /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos ./
cp -r /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL ./
rm eQTLGen.eQTL/*.done
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS
tar -zcvf for_zac_160622.tar.gz for_zac_160622
Show code
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
Rscript FeaturePred.V2.0.R \
--PLINK_prefix_chr LDREF/1000G.EUR. \
--weights FUSION.YFS/YFS.BLOOD.RNAARR.pos \
--weights_dir FUSION.YFS \
--ref_ld_chr LDREF/1000G.EUR. \
--plink ./plink \
--n_cores 1 \
--memory 10000 \
--all_mod T \
--pigz ./pigz \
--ref_maf LDREF/1000G.EUR. \
--output test/FUSION.YFS
Show code
cd /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/for_zac
/users/k1806347/brc_scratch/Software/Rscript.sh FeaturePred.V2.0.nodel2.R \
--PLINK_prefix_chr LDREF/1000G.EUR. \
--weights eQTL.YFS/YFS.eQTL.pos \
--weights_dir eQTL.YFS \
--ref_ld_chr LDREF/1000G.EUR. \
--plink ./plink \
--n_cores 1 \
--memory 10000 \
--all_mod T \
--pigz ./pigz \
--ref_maf LDREF/1000G.EUR. \
--output test/eQTL.YFS
Zac now sends me the predicted expression data for GTEx v8 so I can compare predicted and observed expression. Observed expression in GTEx is publicly available.
Here I am reading in observed and predicted expression values, testing their correlation, and then summarising the results across gene expression imputation methods. I am using the processed and normalised observed expression in GTEx. First I covary observed expression for covariates used in the original eQTL analysis by GTEx.
Show code
library(data.table)
# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')
# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id_version','external_gene_name'), mart = ensembl)
Genes$ensembl_gene_id_version<-gsub('\\..*','',Genes$ensembl_gene_id_version)
obs<-merge(obs, Genes, by.x='gene_id', by.y='ensembl_gene_id_version')
obs<-obs[,c('external_gene_name',names(obs)[grepl('GTEX-', names(obs))]), with=F]
obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))
# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))
# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')
# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/05072022/FeaturePredictions_YFS.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('YFS.','', names(eqtl)))
fusion<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/FUSION.YFS/FeaturePredictions_YFS.BLOOD.RNAARR.txt.gz')
names(fusion)<-gsub('YFS.BLOOD.RNAARR.YFS.','fusion.', names(fusion))
# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
fusion_genes<-unique(gsub('\\..*','',gsub('fusion.','',names(fusion)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)
both_genes<-intersect(both_genes, fusion_genes)
# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)
# Merge eqtl and fusion predicted expression
pred_exp<-merge(eqtl, fusion, by=c('FID','IID'))
# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
# Rename columns to make output consistent across genes and label the observed expression
pred_exp_tmp<-pred_exp[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(pred_exp)), with=F]
names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
# merge predicted and observed expression
both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)
# Calculate correlation
cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}
# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)
write.table(cor_res_melt_obs, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt', row.names=F, quote=F)
# cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt')
cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i])))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
labs(x='Test', y='Correlation') +
coord_flip()
dev.off()
# Make a pairs plot
library("ggplot2")
library("GGally")
cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")
cor_res_melt_obs_unmelt$negative
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_vline(xintercept= 0, colour='blue') +
geom_hline(yintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
diagfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_density() +
geom_vline(xintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot.png', units='px', width=5000, height=5000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()
# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
n_valid<-rbind(n_valid, data.frame(Method=i,
N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))
# Also count the number of times each method performed best
cor_res_melt_obs_top<-cor_res_melt_obs[order(-cor_res_melt_obs$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!is.na(cor_res_melt_obs_top$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]
n_valid$N_top<-NA
n_valid$median_top<-NA
for(i in n_valid$Method){
n_valid$N_top[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
n_valid$median_top[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}
n_valid$Prop_top<-n_valid$N_top/sum(n_valid$N_top)
# Also count the number of times each sumstat method performed best
cor_res_melt_obs_ss<-cor_res_melt_obs[grepl('eQTL', cor_res_melt_obs$test),]
cor_res_melt_obs_ss<-cor_res_melt_obs_ss[!is.na(cor_res_melt_obs_ss$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_ss[order(-cor_res_melt_obs_ss$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]
n_valid$N_top_ss<-NA
n_valid$median_top_ss<-NA
for(i in n_valid$Method){
n_valid$N_top_ss[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
n_valid$median_top_ss[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}
n_valid$Prop_top_ss<-n_valid$N_top_ss/sum(n_valid$N_top_ss)
# Also count the number of times each sumstat method performed best
cor_res_melt_obs_top1<-cor_res_melt_obs[grepl('top1', cor_res_melt_obs$test),]
cor_res_melt_obs_top1<-cor_res_melt_obs_top1[!is.na(cor_res_melt_obs_top1$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top1[order(-cor_res_melt_obs_top1$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]
n_valid$N_top_top1<-NA
n_valid$median_top_top1<-NA
for(i in n_valid$Method){
n_valid$N_top_top1[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
n_valid$median_top_top1[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}
n_valid$Prop_top_top1<-n_valid$N_top_top1/sum(n_valid$N_top_top1)
write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid.csv', row.names=F)
# Read in reported R2 by weights
yfs.profile<-fread('/mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.profile')
Show correlation between observed and predicted expression
| Method | N_valid | median_cor | Freq_valid | N_top | median_top | Prop_top | N_top_ss | median_top_ss | Prop_top_ss | N_top_top1 | median_top_top1 | Prop_top_top1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fusion.lasso | 882 | 0.0478431 | 0.2206655 | 353 | 0.0908498 | 0.0884047 | 0 | NA | 0.0000000 | 0 | NA | 0.0000000 |
| fusion.enet | 876 | 0.0483409 | 0.2191644 | 343 | 0.0862650 | 0.0859003 | 0 | NA | 0.0000000 | 0 | NA | 0.0000000 |
| fusion.bslmm | 874 | 0.0485076 | 0.2186640 | 317 | 0.0959522 | 0.0793889 | 0 | NA | 0.0000000 | 0 | NA | 0.0000000 |
| eQTL.prscs | 790 | 0.0473161 | 0.1976482 | 220 | 0.0901410 | 0.0550964 | 652 | 0.0861865 | 0.1632858 | 0 | NA | 0.0000000 |
| fusion.top1 | 774 | 0.0414693 | 0.1936452 | 440 | 0.0713319 | 0.1101928 | 0 | NA | 0.0000000 | 1801 | 0.0582154 | 0.4527401 |
| fusion.blup | 762 | 0.0450951 | 0.1906430 | 394 | 0.0670281 | 0.0986727 | 0 | NA | 0.0000000 | 0 | NA | 0.0000000 |
| eQTL.lassosum | 698 | 0.0411294 | 0.1746310 | 384 | 0.0633383 | 0.0961683 | 931 | 0.0697682 | 0.2331580 | 0 | NA | 0.0000000 |
| eQTL.dbslmm | 655 | 0.0429859 | 0.1638729 | 291 | 0.0664697 | 0.0728775 | 501 | 0.0630978 | 0.1254696 | 0 | NA | 0.0000000 |
| eQTL.sbayesr_robust | 621 | 0.0442815 | 0.1553665 | 126 | 0.0675922 | 0.0315552 | 239 | 0.0668272 | 0.0598547 | 0 | NA | 0.0000000 |
| eQTL.ldpred2 | 554 | 0.0384420 | 0.1386040 | 278 | 0.0557534 | 0.0696218 | 430 | 0.0555669 | 0.1076885 | 0 | NA | 0.0000000 |
| eQTL.sbayesr | 520 | 0.0361438 | 0.1300976 | 189 | 0.0540907 | 0.0473328 | 301 | 0.0512844 | 0.0753819 | 0 | NA | 0.0000000 |
| eQTL.top1 | 487 | 0.0255226 | 0.1218414 | 658 | 0.0539891 | 0.1647884 | 939 | 0.0545328 | 0.2351615 | 2177 | 0.0429274 | 0.5472599 |
The correlations are very low compared the R2 reported in the FUSION profile. This is true when using fusion or eQTL based models. Is this due to poor generalisablity across YFS and GTEx?
Show code
library(data.table)
# Read in the observed expression
obs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_expression_matrices/Whole_Blood.v8.normalized_expression.bed.gz')
# Insert external_gene_name
obs<-obs[,-1:-3]
obs$gene_id<-gsub('\\..*','',obs$gene_id)
obs<-t(obs)
colnames(obs)<-obs[1,]
obs<-obs[-1,]
obs<-cbind(row.names(obs),obs)
colnames(obs)[1]<-'ID'
obs<-data.table(obs)
obs<-cbind(obs[,1],data.frame(lapply(obs[,-1], function(x) as.numeric(as.character(x)))))
# Read in covariates
covs<-fread('/users/k1806347/brc_scratch/Data/GTeX/v8/GTEx_Analysis_v8_eQTL_covariates/Whole_Blood.v8.covariates.txt')
covs<-t(covs)
colnames(covs)<-covs[1,]
covs<-covs[-1,]
covs<-cbind(row.names(covs),covs)
colnames(covs)[1]<-'ID'
covs<-data.table(covs)
covs<-cbind(covs[,1],data.frame(lapply(covs[,-1], function(x) as.numeric(as.character(x)))))
# Merge observed expression and covariates
obs<-obs[,!duplicated(names(obs)),with=F]
obs_covs<-merge(obs, covs, by='ID')
# Read in predicted expression
eqtl<-fread('/mnt/lustre/users/k1806347/Data/GTeX/v8/Zac/05072022/FeaturePredictions_eQTLGen.eQTL.txt.gz')
names(eqtl)<-gsub('.RDat','',gsub('eQTLGen.','', names(eqtl)))
# Identify genes available in fusion and eqtl based models
obs_genes<-names(obs)[-1]
eqtl_genes<-unique(gsub('\\..*','',gsub('eQTL.','',names(eqtl)[-1:-2])))
both_genes<-intersect(obs_genes, eqtl_genes)
# Residualise the covariates
obs_resid<-data.frame(ID=obs_covs$ID, stringsAsFactors=F)
for(i in both_genes){
obs_resid[[i]]<-as.numeric(scale(resid(lm(as.formula(paste0(i,' ~ ', paste(names(covs)[-1], collapse=' + '))), data=obs_covs))))
}
obs_resid<-data.table(obs_resid)
# Calculate correlation between observed expression and predicted expression from each method
cor_res<-list()
for(i in both_genes){
# Rename columns to make output consistent across genes and label the observed expression
pred_exp_tmp<-eqtl[,grepl(paste0('FID$|IID$|\\.',i,'\\.'), names(eqtl)), with=F]
names(pred_exp_tmp)<-gsub(paste0('.',i), '', names(pred_exp_tmp))
obs_exp_tmp<-obs_resid[,c('ID',i), with=F]
names(obs_exp_tmp)[names(obs_exp_tmp) == i]<-'obs'
# merge predicted and observed expression
both_exp_tmp<-merge(pred_exp_tmp, obs_exp_tmp, by.x='IID', by.y='ID')
both_exp_tmp$obs<-as.numeric(both_exp_tmp$obs)
# Calculate correlation
cor_res[[i]]<-cor(both_exp_tmp[,-1:-2,with=F], use='p')
}
# melt and combine all the results
cor_res_melt<-melt(cor_res)
cor_res_melt_obs<-cor_res_melt[cor_res_melt$Var1 == 'obs',]
cor_res_melt_obs<-cor_res_melt_obs[cor_res_melt_obs$Var1 != cor_res_melt_obs$Var2,]
cor_res_melt_obs$test<-paste0(cor_res_melt_obs$Var1,'_',cor_res_melt_obs$Var2)
write.table(cor_res_melt_obs, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt', row.names=F, quote=F)
# cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')
cor_res_average<-NULL
for(i in unique(cor_res_melt_obs$test)){
cor_res_average<-rbind(cor_res_average, data.frame(test=i,
Mean=mean(cor_res_melt_obs$value[cor_res_melt_obs$test == i]),
N=sum(cor_res_melt_obs$test == i)))
}
library(ggplot2)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_boxplot_eQTLGen.png', units='px', width=1500, height=1000, res=300)
ggplot(cor_res_melt_obs, aes(x=paste0(Var1,' vs. ',Var2), y=value)) +
geom_boxplot() +
labs(x='Test', y='Correlation') +
coord_flip()
dev.off()
# Make a pairs plot
library("ggplot2")
library("GGally")
cor_res_melt_obs_unmelt<-dcast(data = cor_res_melt_obs,formula = L1~Var2,fun.aggregate = sum,value.var = "value")
# Set values that equal 0 to NA as these represent missing data
cor_res_melt_obs_unmelt[cor_res_melt_obs_unmelt == 0]<-NA
lowerfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_point() +
geom_abline(intercept =0 , slope = 1, colour='red') +
geom_vline(xintercept= 0, colour='blue') +
geom_hline(yintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))+
ylim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
diagfun <- function(data,mapping){
ggplot(data = data, mapping = mapping)+
geom_density() +
geom_vline(xintercept= 0, colour='blue') +
xlim(c(min(cor_res_melt_obs$value),max(cor_res_melt_obs$value)))
}
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_pairsplot_eQTLGen.png', units='px', width=3000, height=3000, res=300)
ggpairs(cor_res_melt_obs_unmelt[,-1], lower = list(continuous = wrap(lowerfun)), diag = list(continuous = wrap(diagfun)))
dev.off()
# Count the number of genes with correlation > 0.1 for each method
n_valid<-NULL
for(i in unique(cor_res_melt_obs$Var2)){
n_valid<-rbind(n_valid, data.frame(Method=i,
N_valid=sum(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i] > 0.1, na.rm=T),
median_cor=median(cor_res_melt_obs$value[cor_res_melt_obs$Var2 == i], na.rm=T)))
}
n_valid<-n_valid[order(-n_valid$N_valid),]
n_valid$Freq_valid<-n_valid$N_valid/length(unique(cor_res_melt_obs$L1))
# Also count the number of times each method performed best
cor_res_melt_obs_top<-cor_res_melt_obs[order(-cor_res_melt_obs$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!is.na(cor_res_melt_obs_top$value),]
cor_res_melt_obs_top<-cor_res_melt_obs_top[!duplicated(cor_res_melt_obs_top$L1),]
n_valid$N_top<-NA
n_valid$median_top<-NA
for(i in n_valid$Method){
n_valid$N_top[n_valid$Method == i]<-sum(cor_res_melt_obs_top$test == paste0('obs_',i))
n_valid$median_top[n_valid$Method == i]<-median(cor_res_melt_obs_top$value[cor_res_melt_obs_top$test == paste0('obs_',i)])
}
n_valid$Prop_top<-n_valid$N_top/sum(n_valid$N_top)
write.csv(n_valid, '/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/n_valid_eQTLGen.csv', row.names=F)
Show correlation between observed and predicted expression
| Method | N_valid | median_cor | Freq_valid | N_top | Prop_top | median_top |
|---|---|---|---|---|---|---|
| eQTL.sbayesr_robust | 2418 | 0.0452008 | 0.1720017 | 1222 | 0.0869318 | 0.0795815 |
| eQTL.prscs | 2376 | 0.0435711 | 0.1690141 | 1793 | 0.1275521 | 0.0738549 |
| eQTL.ldpred2 | 2256 | 0.0413694 | 0.1604780 | 1978 | 0.1407128 | 0.0668918 |
| eQTL.lassosum | 2238 | 0.0403013 | 0.1591976 | 2361 | 0.1679590 | 0.0576601 |
| eQTL.dbslmm | 1984 | 0.0406331 | 0.1411296 | 1888 | 0.1343103 | 0.0590594 |
| eQTL.top1 | 1875 | 0.0321053 | 0.1333760 | 3271 | 0.2326955 | 0.0583043 |
| eQTL.sbayesr | 1372 | 0.0298598 | 0.0975957 | 1544 | 0.1098385 | 0.0521219 |
The predicted-observed correlation is looking better here, with 87% of genes achieving a correlation > 0.1. However, the order of the methods has changed a bit, with top1 and SBayesR (in robust mode) performing best. I am not going to read into the order of methods until we have run across all genes. Note. SuSiE does not converge due to LD reference mismatch leading to many genes not being imputed using SuSiE.
# Read in pred_obs_cor for YFS and eQTLGen
eqtlgen<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')
yfs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_YFS.txt')
# Subset YFS models from fusion
yfs<-yfs[grepl('fusion',yfs$Var2),]
# Compare number of (valid) genes from YFS and eQTLGen
length(unique(eqtlgen$L1)) # 14058
length(unique(yfs$L1)) # 3997
length(unique(eqtlgen$L1[eqtlgen$value > 0.1])) # 3453
length(unique(yfs$L1[yfs$value > 0.1])) # 1081
# Subset genes available in both fusion YFS and eQTLGen
library(biomaRt)
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl", GRCh=37)
Genes<-getBM(attributes=c('ensembl_gene_id','external_gene_name'), mart = ensembl)
eqtlgen<-merge(eqtlgen, Genes, by.x='L1', by.y='ensembl_gene_id')
eqtlgen$L1<-eqtlgen$external_gene_name
eqtlgen_subset<-eqtlgen[eqtlgen$L1 %in% yfs$L1,]
yfs_subset<-yfs[yfs$L1 %in% eqtlgen$L1,]
# Compare R for best eQTLGen model to best FUSION YFS model
eqtlgen_subset<-eqtlgen_subset[order(-eqtlgen_subset$value),]
eqtlgen_subset<-eqtlgen_subset[!is.na(eqtlgen_subset$value),]
eqtlgen_subset<-eqtlgen_subset[!duplicated(eqtlgen_subset$L1),]
yfs_subset<-yfs_subset[order(-yfs_subset$value),]
yfs_subset<-yfs_subset[!is.na(yfs_subset$value),]
yfs_subset<-yfs_subset[!duplicated(yfs_subset$L1),]
both_best<-merge(eqtlgen_subset, yfs_subset, by='L1')
nrow(both_best) # 3843
median(both_best$value.x) # 0.0732
median(both_best$value.y) # 0.062
library(ggplot2)
library(cowplot)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor_eQTLGen_YFS_comp.png', units='px', width=1500, height=1500, res=300)
ggplot(both_best, aes(x=value.x, y=value.y)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, colour='red') +
geom_vline(xintercept= 0, colour='blue') +
geom_hline(yintercept= 0, colour='blue') +
coord_fixed() +
theme_half_open() +
background_grid() +
labs(x="R (eQTLGen)", y="R (FUSION YFS)")
dev.off()
# Look at distribution of R for best model for genes not in YFS
eqtlgen_not_in_yfs<-eqtlgen[!(eqtlgen$L1 %in% yfs$L1),]
eqtlgen_not_in_yfs<-eqtlgen_not_in_yfs[!is.na(eqtlgen_not_in_yfs$value),]
eqtlgen_not_in_yfs<-eqtlgen_not_in_yfs[!duplicated(eqtlgen_not_in_yfs$L1),]
eqtlgen_not_in_yfs$YFS<-F
eqtlgen_subset$YFS<-T
eqtlgen_comp_plot<-rbind(eqtlgen_not_in_yfs, eqtlgen_subset)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/present_in_YFS_hist.png', units='px', width=1500, height=1000, res=300)
ggplot(eqtlgen_comp_plot, aes(x=value, fill=YFS, colour=YFS)) +
geom_histogram(position="identity", alpha=0.5) +
theme_half_open() +
background_grid() +
labs(x='R', fill = 'In YFS', colour='In YFS')
dev.off()
Show eQTLGen and FUSION YFS comparison
Show code
# Prepare the eQTL summary statistics for each gene as if they were GWAS summary statistics
library(data.table)
# Read in eQTL sumstats
eqtl_sumstats<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt')
# Create list of genes
genes<-unique(eqtl_sumstats$GENE)
# For each gene
for(i in 1:length(genes)){
# Subset sumstats for gene i
eqtl_sumstats_i<-eqtl_sumstats[eqtl_sumstats$GENE == genes[i],]
# Munge for FUSION
eqtl_sumstats_i<-eqtl_sumstats_i[,c('SNP','A1','A2','Z','N'),with=F]
eqtl_sumstats_i<-eqtl_sumstats_i[complete.cases(eqtl_sumstats_i),]
eqtl_sumstats_i<-eqtl_sumstats_i[eqtl_sumstats_i$N >= 0.8*max(eqtl_sumstats_i$N),]
# Write out results
fwrite(eqtl_sumstats_i, paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/per_gene_eqtl/',genes[i],'.gz'), quote=F, sep='\t')
}
Adapt FUSION.assoc_test.R to run TWAS for specific SNP-weight. We will force it to use each model. If this approach works well we can streamline this process substantially.
# Now run TWAS for each gene using all models.
library(data.table)
# Read in list of eQTLGen SNP-weights
pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos')
# For each gene
for(i in 1:nrow(pos)){
for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
if(!file.exists(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod))){
system(paste0('/users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo.R --sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/per_gene_eqtl/',pos$ID[i],'.gz --extract_weight ',pos$ID[i],' --weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos --weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen --ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. --out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod,' --chr ',pos$CHR[i],' --force_model ',mod))
}
}
}
# Read in the TWAS.Z results
pseudo_res<-NULL
for(i in 1:nrow(pos)){
for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
res_i_mod<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos$ID[i],'.',mod))
pseudo_res<-rbind(pseudo_res, data.frame(ID=pos$ID[i],
Model=mod,
TWAS.Z=res_i_mod$TWAS.Z))
}
}
# Read in the correlation between predicted and observed expression in GTEx v8.
cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')
cor_res_melt_obs<-cor_res_melt_obs[,c('L1','Var2','value'),with=F]
names(cor_res_melt_obs)<-c('ID','Model','R')
cor_res_melt_obs$Model<-gsub('eQTL.','',cor_res_melt_obs$Model)
obs_pseudo<-merge(pseudo_res, cor_res_melt_obs, by=c('ID','Model'))
# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(obs_pseudo$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
obs_pseudo_i<-obs_pseudo[obs_pseudo$ID == genes[i],]
obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$R)
obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
pseudo_top_mod=obs_pseudo_i$Model[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
pseudo_R = obs_pseudo_i$R[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
obs_top_mod=obs_pseudo_i$Model[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
top_R = obs_pseudo_i$R[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1]))
}
# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.6105157
# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff) # -0.006272514
# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.4155844
# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/best_pseudo_comp.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)
rank_diff<-rank_diff[order(rank_diff$diff),]
library(ggplot2)
library(cowplot)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/best_pseudo_comp.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
geom_abline(intercept = 0, slope = 1, colour='red') +
geom_point() +
coord_fixed() +
theme_half_open() +
xlim(c(0,NA)) +
ylim(c(0,NA)) +
background_grid() +
labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()
##########
# Identify genes with abnormally low correlation with observed when using SuSiE
##########
cor_res_melt_obs_susie<-cor_res_melt_obs[cor_res_melt_obs$Model == 'susie',]
# They have already been removed, probably due to zero standard deviation in predicted epxression. I am guessing there is a BETA of 0 for all variants in these cases.
genes[!(genes %in% cor_res_melt_obs_susie$ID)]
# Load the Rdat file for genes without SusiE predicted observed correlation
Comparison of model that performed best in GTEx to model that was selected by pseudovalidation
## top1_pseudo sbayesr_pseudo sbayesr_robust_pseudo
## top1_obs 39 0 7
## sbayesr_obs 2 0 0
## sbayesr_robust_obs 1 1 2
## dbslmm_obs 1 1 0
## prscs_obs 0 0 0
## susie_obs 1 0 0
## dbslmm_pseudo prscs_pseudo susie_pseudo
## top1_obs 6 2 4
## sbayesr_obs 1 0 0
## sbayesr_robust_obs 1 0 1
## dbslmm_obs 2 1 0
## prscs_obs 1 0 0
## susie_obs 0 1 2
These results are really encouraging. There is very little difference between the observed-predicted expression correlation for the best performing model and the model selected by pseudovalidation. There is some overfitting in the GTEx data as well since there is the best model is being selected in the GTEx sample. I am not sure there is any need to try testing MegaPRS pseudovalidation.
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2
for chr in $(seq 1 22);do
sbatch -p brc,shared --mem 10G /users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo_V2.R \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen_sumstats_exclGTeX.txt \
--weights /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos \
--weights_dir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/pseudo_res_chr${chr} \
--chr ${chr} \
--all_mod T
done
library(data.table)
# Read in pseudoval results
pseudo_res_2<-NULL
for(i in 1:22){
pseudo_res_2<-rbind(pseudo_res_2, fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/pseudo_res_chr',i)))
}
names(pseudo_res_2)[names(pseudo_res_2) == 'MODEL']<-'Model'
pseudo_res_2<-pseudo_res_2[,c('ID','Model','TWAS.Z'), with=F]
pos<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGenall/eQTLGen.eQTL.pos')
pos2<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/eQTLGen/eQTLGen.eQTL.pos')
pseudo_res<-NULL
for(i in 1:nrow(pos2)){
for(mod in c('top1','sbayesr','sbayesr_robust','dbslmm','prscs','susie')){
res_i_mod<-fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo/pseudo_res/', pos2$ID[i],'.',mod))
pseudo_res<-rbind(pseudo_res, data.frame(ID=pos2$ID[i],
Model=mod,
TWAS.Z=res_i_mod$TWAS.Z))
}
}
pseudo_comp<-merge(pseudo_res, pseudo_res_2, by=c('ID','Model'))
# New ad old pseudoval scripts correlate perfectly.
# Read in the correlation between predicted and observed expression in GTEx v8.
cor_res_melt_obs<-fread('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/GTEx_v8_pred_exp/pred_obs_cor.txt')
cor_res_melt_obs<-cor_res_melt_obs[,c('L1','Var2','value'),with=F]
names(cor_res_melt_obs)<-c('ID','Model','R')
cor_res_melt_obs$Model<-gsub('eQTL.','',cor_res_melt_obs$Model)
obs_pseudo<-merge(pseudo_res_2, cor_res_melt_obs, by=c('ID','Model'))
# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(obs_pseudo$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
obs_pseudo_i<-obs_pseudo[obs_pseudo$ID == genes[i],]
obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$R)
obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
pseudo_top_mod=obs_pseudo_i$Model[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
pseudo_R = obs_pseudo_i$R[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
obs_top_mod=obs_pseudo_i$Model[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
top_R = obs_pseudo_i$R[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1]))
}
# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.08380318
# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff, na.rm=T) # -0.02506432
# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.8415651
# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)
rank_diff<-rank_diff[order(rank_diff$diff),]
library(ggplot2)
library(cowplot)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
geom_abline(intercept = 0, slope = 1, colour='red') +
geom_point() +
coord_fixed() +
theme_half_open() +
background_grid() +
labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo2/best_pseudo_comp_eQTLGen_hist.png', res=300, units='px', width=1500, height=1000)
ggplot(rank_diff, aes(x=diff)) +
geom_histogram() +
theme_half_open() +
background_grid() +
labs(x="Difference in R")
dev.off()
Comparison of model that performed best in GTEx to model that was selected by pseudovalidation
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'
## Confusion Matrix and Statistics
##
##
## dbslmm lassosum prscs sbayesr sbayesr_robust ldpred2 top1
## dbslmm 715 803 425 450 311 529 1054
## lassosum 1011 1324 1231 935 809 1275 1906
## prscs 3 8 3 3 10 6 7
## sbayesr 113 183 101 133 75 111 214
## sbayesr_robust 23 18 16 5 9 16 19
## ldpred2 19 26 12 10 6 36 54
## top1 1 4 2 3 0 1 3
##
## Overall Statistics
##
## Accuracy : 0.1584
## 95% CI : (0.1524, 0.1646)
## No Information Rate : 0.2321
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0061
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: dbslmm Class: lassosum Class: prscs Class: sbayesr
## Sensitivity 0.37931 0.55959 0.0016760 0.086420
## Specificity 0.70591 0.38560 0.9969774 0.936199
## Pos Pred Value 0.16678 0.15593 0.0750000 0.143011
## Neg Pred Value 0.87993 0.81191 0.8722750 0.892680
## Prevalence 0.13435 0.16863 0.1275747 0.109686
## Detection Rate 0.05096 0.09436 0.0002138 0.009479
## Detection Prevalence 0.30554 0.60516 0.0028508 0.066282
## Balanced Accuracy 0.54261 0.47260 0.4993267 0.511309
## Class: sbayesr_robust Class: ldpred2 Class: top1
## Sensitivity 0.0073770 0.018237 0.0009211
## Specificity 0.9924284 0.989467 0.9989790
## Pos Pred Value 0.0849057 0.220859 0.2142857
## Neg Pred Value 0.9130341 0.860254 0.7678533
## Prevalence 0.0869503 0.140688 0.2321289
## Detection Rate 0.0006414 0.002566 0.0002138
## Detection Prevalence 0.0075547 0.011617 0.0009978
## Balanced Accuracy 0.4999027 0.503852 0.4999501
mkdir /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs
for chr in $(seq 1 22);do
sbatch -p brc,shared --exclude=nodeb11,nodeb16,nodeb09,nodeb18 --mem 10G /users/k1806347/brc_scratch/Software/Rscript.sh /scratch/users/k1806347/Software/MyGit/eQTL_to_TWAS/FUSION.assoc_test.pseudo_V2.R \
--sumstats /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/YFS/YFS_eQTL_from_fusion.txt \
--weights /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR/YFS.BLOOD.RNAARR.pos \
--weights_dir /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/SNP-weights/YFS.BLOOD.RNAARR \
--ref_ld_chr /mnt/lustre/groups/biomarkers-brc-mh/TWAS_resource/FUSION/LDREF/1000G.EUR. \
--out /scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/pseudo_res_chr${chr} \
--chr ${chr} \
--all_mod T
done
library(data.table)
# Read in pseudoval results
pseudo_res_2<-NULL
for(i in 1:22){
pseudo_res_2<-rbind(pseudo_res_2, fread(paste0('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/pseudo_res_chr',i)))
}
# Determine the best model based on TWAS.Z and observed cor, and rank methods by TWAS.Z and observed cor for each gene.
genes<-unique(pseudo_res_2$ID)
obs_pseudo_rank<-NULL
rank_diff<-NULL
n_top<-NULL
for(i in 1:length(genes)){
obs_pseudo_i<-pseudo_res_2[pseudo_res_2$ID == genes[i],]
obs_pseudo_i$pseudo_rank<-rank(-obs_pseudo_i$TWAS.Z)
obs_pseudo_i$obs_rank<-rank(-obs_pseudo_i$MODELCV.R2)
obs_pseudo_rank<-rbind(obs_pseudo_rank, obs_pseudo_i)
rank_diff<-rbind(rank_diff, data.frame(ID=genes[i],
pseudo_top_mod=obs_pseudo_i$MODEL[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1],
pseudo_R = sqrt(obs_pseudo_i$MODELCV.R2[obs_pseudo_i$pseudo_rank == min(obs_pseudo_i$pseudo_rank)][1]),
obs_top_mod=obs_pseudo_i$MODEL[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1],
top_R = sqrt(obs_pseudo_i$MODELCV.R2[obs_pseudo_i$obs_rank == min(obs_pseudo_i$obs_rank)][1])))
}
# Calculate correlation between model rank for each gene
cor(obs_pseudo_rank$pseudo_rank, obs_pseudo_rank$obs_rank) # 0.0623
# Calculate the mean difference in R between the best model and the model selected by TWAS.Z
rank_diff$diff<-rank_diff$pseudo_R-rank_diff$top_R
mean(rank_diff$diff, na.rm=T) # -0.024
# Calculate the percentage of genes where pseuodoval did not select the correct model
nrow(rank_diff[rank_diff$diff != 0,])/nrow(rank_diff) # 0.827
# Crosstab the selected observed and pseudoval-selected best model
saveRDS(table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod),'/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp.rds')
table(rank_diff$pseudo_top_mod, rank_diff$obs_top_mod)
table(rank_diff$pseudo_top_mod)
table(rank_diff$obs_top_mod)
rank_diff<-rank_diff[order(rank_diff$diff),]
library(ggplot2)
library(cowplot)
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp.png', res=300, units='px', width=1000, height=1000)
ggplot(rank_diff, aes(x=top_R, y=pseudo_R)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, colour='red') +
coord_fixed() +
theme_half_open() +
background_grid() +
labs(x="R (Best)", y="R (Pseudovalidated)")
dev.off()
png('/scratch/groups/biomarkers-brc-mh/TWAS_resource/eQTL_to_TWAS/data/pseudo_fusion_yfs/best_pseudo_comp_hist.png', res=300, units='px', width=1500, height=1000)
ggplot(rank_diff, aes(x=diff)) +
geom_histogram() +
theme_half_open() +
background_grid() +
labs(x="Difference in R")
dev.off()
Comparison of model that performed best in YFS to model that was selected by pseudovalidation
## Confusion Matrix and Statistics
##
##
## enet bslmm blup lasso
## enet 186 155 103 65
## bslmm 130 223 206 94
## blup 590 1184 215 803
## lasso 1 1 0 2
##
## Overall Statistics
##
## Accuracy : 0.1582
## 95% CI : (0.1469, 0.1699)
## No Information Rate : 0.3949
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0371
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: enet Class: bslmm Class: blup Class: lasso
## Sensitivity 0.20507 0.14267 0.41031 0.0020747
## Specificity 0.89413 0.82046 0.24956 0.9993320
## Pos Pred Value 0.36542 0.34150 0.07701 0.5000000
## Neg Pred Value 0.79095 0.59455 0.73499 0.7567021
## Prevalence 0.22916 0.39490 0.13239 0.2435574
## Detection Rate 0.04699 0.05634 0.05432 0.0005053
## Detection Prevalence 0.12860 0.16498 0.70541 0.0010106
## Balanced Accuracy 0.54960 0.48157 0.32993 0.5007033
The variance explained by the best model identified by pseudovalidation is very similar to the best model identified by cross validation, however the pseudoval model is often not the model identified by cross validation.